Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

show expected and problematic output produced by deviceQuery in GPU docs #139

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

boegel
Copy link
Contributor

@boegel boegel commented Dec 21, 2023

showing output in case it doesn't work is useful for searching purposes...

@boegel boegel added documentation Improvements or additions to documentation enhancement New feature or request labels Dec 21, 2023
...
```

If the `deviceQuery` command can not access your GPU, you will see an error message like:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't actually happen though, because of the Lmod guards the only scenario I can see where you would reach this is where you are using a container and the system drivers are too old

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I triggered it by cleaning out the host_injections directory after loading the module.

I agree it's very unlikely that it happens, but we should mention it in the docs regardless, if only to let people easily find this page when searching for error messages.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern here is that the placement here makes it seem like it not working is likely, but reaching this message is actually very unlikely

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a little box saying What does it look like if the command fails?

@@ -152,10 +152,32 @@ The only scenario where this would be required is if `$LD_LIBRARY_PATH` is modif

### Testing the GPU support {: #gpu_cuda_testing }
Copy link
Collaborator

@casparvl casparvl Dec 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, this only treats testing if you can run CUDA-enabled software from EESSI. Maybe we can also include a small instruction for testing if building new CUDA software on top of EESSI works properly. Something like this:
First, create a file hello_cuda.cu with the contents

#include <stdio.h>

__global__ void helloCUDA()
{
    printf("Hello, CUDA!\n");
}

int main()
{
    helloCUDA<<<1, 1>>>();
    cudaDeviceSynchronize();
    return 0;
}

Then

module load CUDA/<some_version>
nvcc -o hello_cuda.cu -o hello_cuda
chmod u+x hello_cuda
./hello_cuda 

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And mention they should test this for each version of CUDA they installed in host_injections

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, but that should be done in a separate PR?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want, sure. I won't block this one over it :) Although I would consider it to be an integral part of "Testing the GPU support" to be honest :)

Copy link
Member

@ocaisa ocaisa Aug 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see it as so integral if we are focused on software consumers, it's only integral if you want to do development-type work

Comment on lines +177 to +183
If the `deviceQuery` command can not access your GPU, you will see an error message like:
```
cudaGetDeviceCount returned 35
-> CUDA driver version is insufficient for CUDA runtime version
Result = FAIL
```
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
If the `deviceQuery` command can not access your GPU, you will see an error message like:
```
cudaGetDeviceCount returned 35
-> CUDA driver version is insufficient for CUDA runtime version
Result = FAIL
```
```
!!! note "What if the `deviceQuery` command fails?"
If the `deviceQuery` command cannot access your GPU, you will see an error message like:
```
cudaGetDeviceCount returned 35
-> CUDA driver version is insufficient for CUDA runtime version
Result = FAIL
```

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants